All Questions
1 question
1vote
0answers
57views
Can I reduce computation by only predicting response tokens in a transformer and still get the same gradients?
I have been looking at the source code of the Stanford Alpaca model and I believe that during inference, the whole instruction + response data is fed into the model normally. Then the instruction part ...